An Analytical Model for a Class of Architectures under Master-Slave Paradigm
نویسندگان
چکیده
We build an analytical model for an application utilizing master-slave paradigm. In the model, only three architecture parameters are used: latency, bandwidth and flop rate. Instead of using the vendor supplied or experimentally determined values, these parameters are estimated using the analytical model itself. Experimental results on Cray T3E and SGI Origin 2000 indicate that this simple model can give fair predictions. While building a performance model, it is crucial to catch the main factors of behavior of the program in question. These factors are the parallelization strategy used, the amount of communication and computation, and the architecture of the parallel computer. The software employed combined with the chosen message passing paradigm plays a significant role in the effective values of the architecture parameters. A promising approach is to build a simplified model for a “real” application on a target architecture. The purpose of this paper is to build an analytical model to predict the behavior of iterative numerical algorithms on a class of architectures using master-slave paradigm. Under the master-slave paradigm, the execution of the parallel program can be seen as a sequence of parallel and purely sequential phases. In the parallel phase the slaves compute concurrently and in the sequential phase only the master does computation. Between these phases there is communication between the master and slaves either in form of single node broadcast from the master, or single sends and receives between the master and any one of the slaves. We are going to analyze one of many possible parallel programs under this paradigm to show that in certain cases, it is possible to quantify the influence of the factors mentioned above on the performance, and a small number of architecture parameters can be defined to be used in an analytical model. In order to estimate the architecture parameters and their interactions we will make some assumptions. The iterative algorithm used in the analysis is a block Jacobi algorithm for the solution of linear least squares problems outlined by Dennis and Steihaug [2]. One important aspect of this algorithm is that the amount of communication and computation changes with different values of an algorithmic parameter p [4]. For the performance analysis of the algorithm we need to estimate the run time of one loop and the total number of loops. The run time of one loop, tloop , is estimated by two “components”, computation and communication time. M. Bubak et al. (Eds.): HPCN 2000, LNCS 1823, pp. 601–604, 2000. c © Springer-Verlag Berlin Heidelberg 2000 602 Yasemin Yalçınkaya and Trond Steihaug When the number of processors at hand is smaller than the number of tasks, we assign more tasks to each slave. Assuming that when a slave finishes sending its result to the master the data is received instantaneously by the master and the slave continues with its next task until it runs out of tasks, tloop is: tloop = max ∀ slaves { kslave ∑ i=1 (tc slave(i) + ts send(i))} + tc master + tbroadcast, (1) where tc slave(i) is the computation time of task i on slave, ts send(i) is the time to send computation results from slave to the master, tc master is the computation time on the master, tbroadcast is the time used by the master to broadcast data to the slaves and kslave is the number of tasks assigned to slave. The computation time is assumed to be proportional to the number of floating point operations (flops). The communication time is assumed to consist of two parts: a startup time (latency) and a part proportional to the amount of data sent. The proportion factor depends on the number of processors in use and the intercommunication topology. To analyze the computation time, we count the number of flops, additions and multiplications, for each task. We assume that the time it takes to do a multiplication operation is equal to the time for an addition operation. The values of tc master and tc slave are estimated by multiplying the respective flop counts with the inverse of the flop rate of the architecture in use. In the implementation, MPI is employed as the message passing library. When the slaves begin sending their results, the master is already ready to accept. We assume that the time used by the master to retrieve the arrived data from the buffer is negligible. Let us define tsend = l + τβπ, where τ is the message size, β = 8/BW is the transfer time per byte, l is latency, and π is a variable depending on the number of hops between the sender and receiver. BW is the bandwidth of the parallel computer architecture and 8 is the number of bytes in one data element. For simplicity, bandwidth is taken as constant in the model. When the number of slaves is only one, we assume that the master and slave processors are neighbors, and π = 1. On SGI Origin 2000 the interprocessor communication network has hypercube topology and on Cray T3E, each PE has 6 neighbors [4]. The topological structure of the network, hence the number of network links traversed by a message, is critical. When the number of slaves is increased we do not know the topology of the partition assigned to our program but we will assume that the partition is “dense”. We can approximate the average distance between any two processors by 1/2 log2 ns on Origin 2000, and with 3/4n 1/3 s on Cray T3E, where ns is the number of slaves. Hence, π is 1/2 log2 ns and 3/4n 1/3 s [1] on Origin 2000 and Cray T3E respectively when ns > 1. Single node broadcast operation in MPI is implemented using a binomial tree-structure approach [3]. Therefore, the broadcast time is proportional to log2 ns. We use only three parameters in the analytical model: flop rate, latency and bandwidth. However, instead of using the vendor supplied or experimentally An Analytical Model for a Class of Architectures 603 Table 1. Estimated values of architecture parameters architecture l BW β α SGI Origin 2000 22.69 μs 13.38 Mbyte 0.59 μs 7.42 ns Cray T3E 74.16 μs 16.54 Mbyte 0.48 μs 8.67 ns determined values, these parameters are estimated using the analytical model itself. The variable π is not considered as a parameter since it is determined by the topology of the communication network. To estimate the values of architecture parameters, we measure the execution time of an actual implementation on increasing problem sizes using one slave. Using only one slave enables us to avoid the effect of other interference from the system. Execution time on different problem sizes reflect the possible change in bandwidth. We take algorithmic parameter p to be zero. The implementation is run for a couple of times for different problem sizes, and the average of execution times for each problem is taken. We calculate the number of flops, count the number of send, receive and broadcast operations and estimate the latency, bandwidth and flop rate of the architectures in question as the solution of a least squares problem. The architecture parameters estimated to be used in the model are given in Table 1. The parameter α in the table is the time to do one flop. To validate the model we look at the difference between actual execution time measurements and model predicted values. In the figures, increased test problem number means increased problem size [4]. Figures 1 and 2 depict the model validation for execution times on Cray T3E and SGI Origin 2000. We see that on Cray T3E, the predictions are quite accurate, whereas on SGI Origin 2000, when the problem size is increased we are slightly overestimating the execution time. The average deviation between the actual execution times and model predictions is 8.6% of the actual execution times on SGI Origin 2000, and 6.0% on Cray T3E. Figure 3 displays the predicted versus actual execution time on Cray T3E using two slaves and p = 1. When p = 1 the amount of communication remains the same as when p = 0 but the computation on slaves is increased. In Fig. 4, we predict the execution time of the application with a different choice of p, p = Cs using 4 slaves on SGI Origin 2000. We see that for the two largest test problems the overestimation observed in the model is magnified. There are two main differences between this case and the model problem: computation on the slaves is increased along with the size of the messages sent by the slaves, and the number of the slaves is quadrupled. These size increases in the system results in increased error in the prediction. The model can also be used in the analysis of scalability of the application [4]. The estimated parameters do not reflect the architectural characteristics only. The network load, the message passing paradigm used, the program code, the size of the problems are all factors that cause changes in the vendor supplied values for the parameters. This implies that the “estimated” latency is higher than the “machine constant latency”. Similarly, the bandwidth is lower [4]. 604 Yasemin Yalçınkaya and Trond Steihaug 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 test problems ex ec ut io n tim e (s ) Cray T3E, p=0, slaves=1
منابع مشابه
The Master-Slave Paradigm on Heterogeneous Systems: A Dynamic Programming Approach for the Optimal Mapping
We study the master–slave paradigm over heterogeneous systems. According to an analytical model, we develop a dynamic programming algorithm that allows to solve the optimal mapping for such paradigm. Our proposal considers heterogeneity due both to computation and also to communication. The optimization strategy used allows to obtain the set of processors for an optimal computation. The computa...
متن کاملDelay-dependent stability for transparent bilateral teleoperation system: an LMI approach
There are two significant goals in teleoperation systems: Stability and performance. This paper introduces an LMI-based robust control method for bilateral transparent teleoperation systems in presence of model mismatch. The uncertainties in time delay in communication channel, task environment and model parameters of master-slave systems is called model mismatch. The time delay in communicatio...
متن کاملAsynchronous Traffic Signaling over Master-Slave Switched Ethernet protocols
Network protocol designers have always been divided between the adoption of centralized or distributed communication architectures. Despite exhibiting negative aspects like the existence of a single point-of-failure in the master as well as computational overhead and an inefficient handling of the asynchronous communications, Master-Slave protocols have always found their space mainly given the...
متن کاملAnalysis of Control Architectures for Teleoperation Systems with Impedance/Admittance Master and Slave Manipulators
A large number of bilateral teleoperation control architectures in the literature have been designed based on assumed impedance models of the master and slave manipulators. However, hydraulic or heavily geared and many other manipulators cannot be properly described by impedance models. In this paper, a common four-channel bilateral control architecture designed for the above impedance models i...
متن کاملModeling and measuring double-frequency jitter in one-way master–slave networks
1007-5704/$ see front matter 2008 Elsevier B.V doi:10.1016/j.cnsns.2008.09.009 * Corresponding author. Tel.: +55 11 30915647; f E-mail addresses: [email protected] (A.A. Ferrei The double-frequency jitter is one of the main problems in clock distribution networks. In previous works, some analytical and numerical aspects of this phenomenon were studied and results were obtained for one-way mas...
متن کامل